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One of the intriguing problems in Community-based Ques¬ 
tion Answering (CQA) research is the automatic identifica¬ 
tion of the best answer, which is expected to benefit various 
stakeholders. First of all, since several answers are provided 
for each question, the readers of these websites will be able 
to process the candidate answers more efficiently and mit¬ 
igate the “information overload” phenomenon. Secondly, a 
mechanism that identifies high quality answers will increase 
awareness within the community and will help to put more 
effort into questions that remain poorly answered. For in¬ 
stance, in StackOverfiov[3(SO) alone, as of September 2013, 
we found that approximately 33% of the questions have yet 
to be marked as resolved (i.e., out of the 5 million, 1.7 mil¬ 
lion questions have no answer marked as “accepted”). 

Researchers in related fields have used lexical, syntactic, 
and discourse features to produce a predictive model of read¬ 
ers’ judgments [3]- In several cases, the use of shallow fea¬ 
tures, i.e. features that do not employ semantic or syntactic 
parsing such as sentence length or word length, have been 
shown to be effective in assessing properties such as ease 
of reading or usefulness. However, with respect to CQA, 
research efforts towards the exploitation of shallow features 
report relatively low results. To improve the efficacy of their 
models, researchers refer to more contextual information, 
such as the score of each answer, the comments received 
or the reputation of the user [I]. However, these features 
may not be readily available since a) comments and scores 
introduce an inherent delay, and b) features based on repu¬ 
tation may not be applicable on a newly formed community 
or pose a threat to its development (i.e. preferential attach¬ 
ment) and result in the reinforcement of the pre-existing 
community hierarchy. 

In our approach, we revisit the case of shallow linguis¬ 
tic features and use features found in [5]. Figure [T] shows 
the average feature values for the accepted answers together 
with the non-accepted ones of SO using a one-month window 
time fram43- As seen from the figure, the linguistic features 
clearly differentiate the accepted from the non-accepted an¬ 
swers. More specifically, accepted answers tend to be longer, 
use a less common vocabulary, contain longer words, more 
words per sentence and the longest sentences are lengthier. 
Even though the above remarks look promising concerning 
best answer prediction, when training a binary classifier pre¬ 
diction remains weak (58% precision and 0.56 F-Measure 

'http://stackoverflow.com/ 

^Similar behaviour is identified for all StackExchange web¬ 
sites and is omitted due to space limitations. 


Maria Liakata 
Dept, of Computer Science 
University of Warwick 
Coventry, UK 

m. Iiakata@warwick.ac. u k 




— Accepted Answers 
. 4 Non-Accepted Answers 



Figure 1: Activity and values of the linguistic fea¬ 
tures (y-axis) for the StackOverflow dataset over 
time (x-axis). Top left sub-plot shows the number 
of answers posted every month. The remaining sub¬ 
plots show the average values for the accepted and 
non-accepted answers. 

on average for all StackExchange - SE - websites). A more 
thorough investigation towards the explanation of this poor 
performance leads us to identify two main issues. Firstly, as 
illustrated in Figured] the characteristics of language evolve 
over time. Secondly, while a steady gap between the aver¬ 
age values of accepted and non-accepted endures, a large 
inherent diversity of the posts persists together with a large 
variance. Finally, a cross-examination of absolute values be¬ 
tween different SE websites has shown us that language char¬ 
acteristics differ significantly across SE websites. Since the 
results that we obtained for a classification based on shal¬ 
low features are comparable to similar approaches (e.g. El) 
these results will constitute our baseline for evaluating the 
proposed solution. 

1. FEATURE DISCRETISATION 

Our solution called discretisation is presented in detail 
in [2] and asserts the adoption of a novel way of leveraging 
shallow features and overcome the above limitations. Intu¬ 
itively, our approach is to treat the collection of answers 
for each question as an information unit which can improve 
the training process. Instead of treating each answer inde¬ 
pendently of the other answers it is competing with, our ap¬ 
proach is to assess the value of the features of each answer in 
relation to the corresponding features of its competitors. We 
introduce a new set of features that stem from the linguis¬ 
tic features used so far: instead of dealing with continuous 




















Table 1: Example of feature discretisation for the 
case of Length, 5 submitted answers and 2 questions. 
Column Question Id refers to the question under 
which the answer is submitted. 


Question Id 

Answer Id 

Length 

Lengthy 


2 

200 

2 

1 

3 

150 

3 


4 

250 

1 


6 

250 

1 


7 

200 

2 


values, these new features are the result of grouping, sorting, 
and discretisation. 

We will present an example for the Length feature. Let us 
consider the example of Table[T]where for one question there 
are two candidate answers (i.e., question with Id 5 having 
answers with Id 6 and 7). We have already shown previously 
(Figure[T]) that the longer an answer is, the more likely it is to 
be accepted. In order to represent this preference, we group 
all answers by their corresponding questions (grouping). For 
each group, we then sort the answers (sorting) and assign a 
rank for each answer, starting from 1 and incrementing this 
rank by 1 (discretisation). Sorting is done either in descend¬ 
ing or ascending order, so as the lowest rank is assigned to 
the answers that are marked as accepted (in this example, 
we use the information that longer answers are more likely 
to be accepted, hence descending order is conducted). For 
the example of Table [U the answer with the longest Length 
will receive Length d of value 1 (answer Id 6 with length 250) 
while the answer that comes second a value of 2 (answer Id 
7 with length 200 - note that we are representing the discre- 
tised form of each feature as featureo). The result of this 
process is the introduction of an equal number of linguistic 
features without the usage of any further information (apart 
from the association of a question and its corresponding an¬ 
swer^ . 

2. EVALUATION 

Table [2] presents the results when using different sets of 
features and 10-fold validation. The table contains the aver¬ 
age values for 21 SE websites (including SO) as the output of 
different evaluations on 4 million questions and more than 8 
million answers. Initially, we use the absolute values of tex¬ 
tual features with low results (58% precision, Case 1). The 
second and third Cases both utilise the discretised features, 
while the third is additionally using the other set of features 
(i.e. AnswerCount and CreationDate). Cases 2 and 3 consti¬ 
tute our proposed prediction method. Furthermore Case 4 
refers to a “traditional” approach that relies in plain linguis¬ 
tics and user-reputation ratings. We can see that while a 
whole new set of features is added into the dataset, the per¬ 
formance of classification remains lower than Case 3, which 
is linguistics-based. Case 5 keeps the user ratings in addi¬ 
tion to incorporating all features of Case 3. Hence, classifi¬ 
cation accuracy is the highest compared to all previous clas¬ 
sifications, but almost identical to Case 3 which is strictly 
based on content and discretisation (higher F-Measure 0.77 
vs. 0.76, higher AUC 0.88 vs. 0.87). Finally, Case 6 uses all 
previous features, including the answer ratings. This set of 
features uses all features but most importantly user-entered 
scores and manages to outperform all of the previous cases. 

^Note that other approaches typically omit this information. 


Table 2: Results for best answer prediction using 
different sets of features (Cases 1 to 6) for all SE 
websites. Columns show macro average precision 
(P), recall (R), F-Measure (EM) and Area-Under- 
Curve (AUC) for all 21 SE websites using 10-fold 
validation. 


No. 

Features Used 

P 

R 

FM 

AUc 

1 

Linguistic 

0.58 

0.60 

0.56 

0.60 

2 

Linguistic Sz 
Discretisation 

0.81 

0.70 

0.74 

0.84 

3 

Linguistic Sz 
Discretisation Sz 

Other 

0.84 

0.70 

0.76 

0.87 

4 

Linguistic & Other 
Sz User Rating 
(no discretisation) 

0.82 

0.69 

0.75 

0.86 

5 

Linguistic Sz Other 
Sz User Rating 
(with discretisation) 

0.82 

0.72 

0.77 

0.88 

6 

All features 
(Answer and User 
Rating with discreti¬ 
sation) 

0.88 

0.85 

0.86 

0.94 


Case 6 shows that the information contained within answer 
ratings is independent - to a certain extent - of the infor¬ 
mation found in previous features. 

In summary, results in TabIe[2]show that the discretisation 
of linguistic features manages to outperform significantly the 
classifier based on linguistic features only. Moreover, we 
can also see that user rating features such as reputation do 
not improve our classification, a sign that discretisation is 
a process that extracts very useful information and delivers 
very strong results. 

The whole approach described here has been implemented 
and is offered for free as web browser plugin and a web ser¬ 
vice (https://acqua.kiiii.open.ac.uk). In the future, we 
intend to explore the applicability of our methodology else¬ 
where and investigate further the effect of textual quality on 
answer selection and impact in online fora and social media. 
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